55 research outputs found

    Aprendizaje por refuerzo en espacios de estados continuos

    Get PDF
    El aprendizaje por refuerzo es un modelo de aprendizaje que permite implementar comportamientos inteligentes de forma automática. La mayor parte de la teoría del aprendizaje por refuerzo tiene su fundamento en la programación dinámica, y por tanto, en lo que se denominan funciones de valor. Sin embargo, la implementación tradicional de estas funciones en forma tabular no es práctica cuando el espacio de estados es muy grande, o incluso infinito. Cuando se produce esta situación, se deben aplicar métodos de generalización que permitan extrapolar la experiencia adquirida para un conjunto limitado de estados, a la totalidad del espacio. Existen dos aproximaciones básicas para resolver este problema. Por un lado, están aquellas técnicas que se basan en obtener una discretización adecuada del espacio de estados. Por otro lado, están los métodos basados en implementar las funciones de valor con algún método supervisado de aproximación de funciones, como, por ejemplo, una red de neuronas. En esta tesis doctoral se pretende desarrollar métodos de aprendizaje por refuerzo que sean aplicables en dominios con espacios de estados continuos, partiendo de las dos aproximaciones planteadas anteriormente, fundiendo las ventajas de una y otra en un método eficaz y eficiente que permita que el aprendizaje sea un proceso totalmente automático.Reinforcement Learning is a technique that aliows to implement intelli gent behaviours automatically without the need of introducing knowledge or modeis about the domain. Most of the reinforcement learning theory is based on dynamic programming, and hence, on value functions. These func tions provide information about how good it is, in order to solve a defined task, to be in a given situation in the dornain, typically narned state, or even how good it is to execute a defined action if the system is in a given state. These functions, typically implernented using look-up tables, are used to represent the action policy that must guide the behaviour of the system. However, the traditional implementation of these functions as look-up tables is not practical when the state space is very large, or even infinite. When one of these situations appears, generalization methods must be applied in order to extrapolate the acquired experience for a limited set of states, to the whole space, so optirnal behaviours can be achieved, even when the whole domain has not been explored. Two main approaches can be found in the literature. Qn the one hand, there are methods based on learning an adequate state space discretization, so the continuous state space is mapped to a finite and reduced one. Qn the other hand, methods based oil irnplementing the value functions with sorne supervised learning technique for function approximation, for instance, a neural network, can be found. This dissertation tries to develop reinfor cernent learning methods that can be applied in domains with a continuous state space. The start point is given by the two approaches aboye, and it tries to j oin the advantages of one and another in an efficient and effective method that aliows the learning process be a fully automatic process where the designer has to introduce the less possible arnount of information about the task to solve

    Probabilistic policy reuse for safe reinforcement learning

    Get PDF
    This work introducesPolicy Reuse for Safe Reinforcement Learning, an algorithm that combines ProbabilisticPolicy Reuse and teacher advice for safe exploration in dangerous and continuous state and action reinforce-ment learning problems in which the dynamic behavior is reasonably smooth and the space is Euclidean. Thealgorithm uses a continuously increasing monotonic risk function that allows for the identification of theprobability to end up in failure from a given state. Such a risk function is defined in terms of how far such astate is from the state space known by the learning agent. Probabilistic Policy Reuse is used to safely balancethe exploitation of actual learned knowledge, the exploration of new actions, and the request of teacher advicein parts of the state space considered dangerous. Specifically, thepi-reuse exploration strategy is used. Usingexperiments in the helicopter hover task and a business management problem, we show that thepi-reuseexploration strategy can be used to completely avoid the visit to undesirable situations while maintainingthe performance (in terms of the classical long-term accumulated reward) of the final policy achieved.This paper has been partially supported by the Spanish Ministerio de Economía y Competitividad TIN2015-65686-C5-1-R and the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement No. 730086 (ERGO). Javier García is partially supported by the Comunidad de Madrid (Spain) funds under the project 2016-T2/TIC-1712

    Learning Pedagogical Policies from Few Training Data

    Get PDF
    [Poster of] 17th European Conference on Artificial Intelligence (ECAI'06). Workshop on Planning, Learning and Monitoring with Uncertainty and Dynamic Worlds, Riva del Garda, Italy, August 8, 2006Learning a pedagogical policy in an Adaptive Educational System (AIES) fits as a Reinforcement Learning (RL) problem. However, to learn pedagogical policies requires to acquire a huge amount of experience interacting with the students, so applying RL to the AIES from scratch is infeasible. In this paper we describe RLATES, an AIES that uses RL to learn an accurate pedagogical policy to teach a course of Data Base Design. To reduce the experience required to learn the pedagogical policy, we propose to use an initial value function learned with simulated students, whose model is provided by an expert as a Markov Decision Process. Empirical results demonstrate that the value function learned with the simulated students and transferred to the AIES is a very accurate initial pedagogical policy. The evaluation is based on the interaction of more than 70 Computer Science undergraduate students, and demonstrates that an efficient guide through the contents of the educational system is obtained.This work was supported by the project GPS (TIN2004/07083

    Reinforcement learning of pedagogical policies in adaptive and intelligent educational systems

    Get PDF
    In an adaptive and intelligent educational system (AIES), the process of learning pedagogical policies according the students needs fits as a Reinforcement Learning (RL) problem. Previous works have demonstrated that a great amount of experience is needed in order for the system to learn to teach properly, so applying RL to the AIES from scratch is unfeasible. Other works have previously demonstrated in a theoretical way that seeding the AIES with an initial value function learned with simulated students reduce the experience required to learn an accurate pedagogical policy. In this paper we present empirical results demonstrating that a value function learned with simulated students can provide the AIES with a very accurate initial pedagogical policy. The evaluation is based on the interaction of more than 70 Computer Science undergraduate students, and demonstrates that an efficient and useful guide through the contents of the educational system is obtained.Publicad

    Emergent behaviors and scalability for multi-agent reinforcement learning-based pedestrian models

    Get PDF
    This paper analyzes the emergent behaviors of pedestrian groups that learn through the multiagent reinforcement learning model developed in our group. Five scenarios studied in the pedestrian model literature, and with different levels of complexity, were simulated in order to analyze the robustness and the scalability of the model. Firstly, a reduced group of agents must learn by interaction with the environment in each scenario. In this phase, each agent learns its own kinematic controller, that will drive it at a simulation time. Secondly, the number of simulated agents is increased, in each scenario where agents have previously learnt, to test the appearance of emergent macroscopic behaviors without additional learning. This strategy allows us to evaluate the robustness and the consistency and quality of the learned behaviors. For this purpose several tools from pedestrian dynamics, such as fundamental diagrams and density maps, are used. The results reveal that the developed model is capable of simulating human-like micro and macro pedestrian behaviors for the simulation scenarios studied, including those where the number of pedestrians has been scaled by one order of magnitude with respect to the situation learned.This work has been supported by grant TIN2015-65686-C5-1-R of Ministerio de Economía y Competitividad

    Integrating Planning, Execution, and Learning to Improve Plan Execution

    Get PDF
    Algorithms for planning under uncertainty require accurate action models that explicitly capture the uncertainty of the environment. Unfortunately, obtaining these models is usually complex. In environments with uncertainty, actions may produce countless outcomes and hence, specifying them and their probability is a hard task. As a consequence, when implementing agents with planning capabilities, practitioners frequently opt for architectures that interleave classical planning and execution monitoring following a replanning when failure paradigm. Though this approach is more practical, it may produce fragile plans that need continuous replanning episodes or even worse, that result in execution dead-ends. In this paper, we propose a new architecture to relieve these shortcomings. The architecture is based on the integration of a relational learning component and the traditional planning and execution monitoring components. The new component allows the architecture to learn probabilistic rules of the success of actions from the execution of plans and to automatically upgrade the planning model with these rules. The upgraded models can be used by any classical planner that handles metric functions or, alternatively, by any probabilistic planner. This architecture proposal is designed to integrate off-the-shelf interchangeable planning and learning components so it can profit from the last advances in both fields without modifying the architecture.Publicad

    A taxonomy for similarity metrics between Markov decision processes

    Get PDF
    Although the notion of task similarity is potentially interesting in a wide range of areas such as curriculum learning or automated planning, it has mostly been tied to transfer learning. Transfer is based on the idea of reusing the knowledge acquired in the learning of a set of source tasks to a new learning process in a target task, assuming that the target and source tasks are close enough. In recent years, transfer learning has succeeded in making reinforcement learning (RL) algorithms more efficient (e.g., by reducing the number of samples needed to achieve (near-)optimal performance). Transfer in RL is based on the core concept of similarity: whenever the tasks are similar, the transferred knowledge can be reused to solve the target task and significantly improve the learning performance. Therefore, the selection of good metrics to measure these similarities is a critical aspect when building transfer RL algorithms, especially when this knowledge is transferred from simulation to the real world. In the literature, there are many metrics to measure the similarity between MDPs, hence, many definitions of similarity or its complement distance have been considered. In this paper, we propose a categorization of these metrics and analyze the definitions of similarity proposed so far, taking into account such categorization. We also follow this taxonomy to survey the existing literature, as well as suggesting future directions for the construction of new metricsOpen Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has also been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M17), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)S

    On-line case-based policy learning for automated planning in probabilistic environments

    Get PDF
    Many robotic control architectures perform a continuous cycle of sensing, reasoning and acting, where that reasoning can be carried out in a reactive or deliberative form. Reactive methods are fast and provide the robot with high interaction and response capabilities. Deliberative reasoning is particularly suitable in robotic systems because it employs some form of forward projection (reasoning in depth about goals, pre-conditions, resources and timing constraints) and provides the robot reasonable responses in situations unforeseen by the designer. However, this reasoning, typically conducted using Artificial Intelligence techniques like Automated Planning (AP), is not effective for controlling autonomous agents which operate in complex and dynamic environments. Deliberative planning, although feasible in stable situations, takes too long in unexpected or changing situations which require re-planning. Therefore, planning cannot be done on-line in many complex robotic problems, where quick responses are frequently required. In this paper, we propose an alternative approach based on case-based policy learning which integrates deliberative reasoning through AP and reactive response time through reactive planning policies. The method is based on learning planning knowledge from actual experiences to obtain a case-based policy. The contribution of this paper is two fold. First, it is shown that the learned case-based policy produces reasonable and timely responses in complex environments. Second, it is also shown how one case-based policy that solves a particular problem can be reused to solve a similar but more complex problem in a transfer learning scope.This paper has been partially supported by the Spanish Ministerio de Econom a y Competitividad TIN2015-65686-C5-1-R and the European Union's Horizon 2020 Research and Innovation programme under Grant Agreement No. 730086 (ERGO)

    A three-layer planning architecture for the autonomous control of rehabilitation therapies based on social robots

    Get PDF
    This manuscript focuses on the description of a novel cognitive architecture called NAOTherapist, which provides a social robot with enough autonomy to carry out a non-contact upper limb rehabilitation therapy for patients with physical impairments, such as cerebral palsy and obstetric brachial plexus palsy. NAOTherapist comprises three levels of Automated Planning. In the high-level planning, the physician establishes the parameters of the therapy such as the scheduling of the sessions, the therapeutic objectives to be achieved and certain constraints based on the medical records of the patient. This information is used to establish a customized therapy plan. The objective of the medium-level planning is to execute and monitor every previous planned session with the humanoid robot. Finally, the low-level planning involves the execution of path-planning actions by the robot to carry out different low-level instructions such as performing poses. The technical evaluation shows an accurate definition and monitoring of the therapies and sessions and a fluent interaction with the robot. This automated process is expected to save time for the professionals while guaranteeing the medical criteria.This work is partially funded by grant TIN2015-65686-C5-1-R and TIN2012-38079-C03-02 of Spanish Ministerio de Economía y Competitividad

    Políticas de innovación en turismo y desarrollo de clusters: la percepción gerencial en el programa Agrupaciones Empresariales Innovadoras (AEIs)

    Get PDF
    This paper is part of the analysis of innovation policies in tourism. It studies the use of cluster policies and the configuration of clusters as planned initiatives, by focusing on the analysis of the tourist Innovative Business Group Programme in Spain. A qualitative method is used to study the point of view of managers who play a fundamental role in the Programme. The results of the study are related to the cluster configuration process, the favourable factors and limitations of their activity, the general evaluation of the programme and the identification of proposals on how to improve.Este trabajo se enmarca en el análisis de las políticas de innovación en turismo. Estudia la aplicación de políticas cluster y la configuración de clusters como iniciativas planificadas, centrándose en el análisis del Programa de AEIs turísticas en España. Se emplea una metodologíacualitativa para investigar la percepción de las gerencias como actores fundamentales del Programa. Los resultados de la investigación están relacionados con el proceso deconfiguración de clusters, los factores favorables y limitantes de su actividad, la valoración general del programa y la identificación de propuestas de mejora
    corecore